Goto

Collaborating Authors

 combinatorial action


Reinforcement Learning with Combinatorial Actions: An Application to Vehicle Routing

Neural Information Processing Systems

Value-function-based methods have long played an important role in reinforcement learning. However, finding the best next action given a value function of arbitrary complexity is nontrivial when the action space is too large for enumeration. We develop a framework for value-function-based deep reinforcement learning with a combinatorial action space, in which the action selection problem is explicitly formulated as a mixed-integer optimization problem. As a motivating example, we present an application of this framework to the capacitated vehicle routing problem (CVRP), a combinatorial optimization problem in which a set of locations must be covered by a single vehicle with limited capacity. On each instance, we model an action as the construction of a single route, and consider a deterministic policy which is improved through a simple policy iteration algorithm. Our approach is competitive with other reinforcement learning methods and achieves an average gap of 1.7% with state-of-the-art OR methods on standard library instances of medium size.


Reinforcement learning with combinatorial actions for coupled restless bandits

arXiv.org Artificial Intelligence

Reinforcement learning (RL) has increasingly been applied to solve real-world planning problems, with progress in handling large state spaces and time horizons. However, a key bottleneck in many domains is that RL methods cannot accommodate large, combinatorially structured action spaces. In such settings, even representing the set of feasible actions at a single step may require a complex discrete optimization formulation. We leverage recent advances in embedding trained neural networks into optimization problems to propose SEQUOIA, an RL algorithm that directly optimizes for long-term reward over the feasible action space. Our approach embeds a Q-network into a mixed-integer program to select a combinatorial action in each timestep. Here, we focus on planning over restless bandits, a class of planning problems which capture many real-world examples of sequential decision making. RMAB, a broader class of restless bandits with combinatorial actions that cannot be decoupled across the arms of the restless bandit, requiring direct solving over the joint, exponentially large action space. Our approach significantly outperforms existing methods--which cannot address sequential planning and combinatorial selection simultaneously--by an average of 24.8% on these difficult instances. Reinforcement learning (RL) has made tremendous progress in recent years to solve a wide range of practical problems (Treloar et al., 2020; Marot et al., 2021; Silvestro et al., 2022; Degrave et al., 2022). While successful at dealing with large or infinite state spaces, RL struggles with discrete, combinatorial action spaces. This limitation is pertinent to many real-world sequential decisionmaking problems, where resource constraints frequently lead to combinatorial action spaces (Dulac-Arnold et al., 2020). Consider, for example, a sequential resource allocation problem in which public health workers are dispatched to visit patients. The workers each have a limited daily budget to maximize patient well-being. These requirements give rise to an exponentially large combinatorial action space to optimize over, even when the number of workers and patients is small.


Review for NeurIPS paper: Reinforcement Learning with Combinatorial Actions: An Application to Vehicle Routing

Neural Information Processing Systems

Summary and Contributions: - New reinforcement learning algorithm to solve capacitated vehicle routing problem. However, there are some observations for the Machine Learning community that are of some interest. There is an enduring interest in the reinforcement learning community to investigate ways in which reinforcement learning technologies can play a role in hard combinatorial optimisation settings. Here, following the cited 2018 NeurIPS publication by Nazari et al., the authors of the submitted manuscript develop and evaluate a novel reinforcement learning approach for the capacitated vehicle routing problem (CVRP). The CVRP is a hard combinatorial problem class that includes the Travelling Sales Person problem.


Review for NeurIPS paper: Reinforcement Learning with Combinatorial Actions: An Application to Vehicle Routing

Neural Information Processing Systems

The paper proposes a novel reinforcement learning approach to solving the capacitated vehicle routing problem. It involves learning a value function and solving a TSP for the prizing problem. Reviewers agree that the proposed approach is novel and interesting. One reviewer is sceptical of the work because of doubts about the performance achievable with the proposed approach. However, the ideas presented still deserve to be presented at NeurIPS, with the hope of bringing advances to this research area.


Reinforcement Learning with Combinatorial Actions: An Application to Vehicle Routing

Neural Information Processing Systems

Value-function-based methods have long played an important role in reinforcement learning. However, finding the best next action given a value function of arbitrary complexity is nontrivial when the action space is too large for enumeration. We develop a framework for value-function-based deep reinforcement learning with a combinatorial action space, in which the action selection problem is explicitly formulated as a mixed-integer optimization problem. As a motivating example, we present an application of this framework to the capacitated vehicle routing problem (CVRP), a combinatorial optimization problem in which a set of locations must be covered by a single vehicle with limited capacity. On each instance, we model an action as the construction of a single route, and consider a deterministic policy which is improved through a simple policy iteration algorithm. Our approach is competitive with other reinforcement learning methods and achieves an average gap of 1.7% with state-of-the-art OR methods on standard library instances of medium size.


Combinatorial Neural Bandits

arXiv.org Artificial Intelligence

We consider a contextual combinatorial bandit problem where in each round a learning agent selects a subset of arms and receives feedback on the selected arms according to their scores. The score of an arm is an unknown function of the arm's feature. Approximating this unknown score function with deep neural networks, we propose algorithms: Combinatorial Neural UCB ($\texttt{CN-UCB}$) and Combinatorial Neural Thompson Sampling ($\texttt{CN-TS}$). We prove that $\texttt{CN-UCB}$ achieves $\tilde{\mathcal{O}}(\tilde{d} \sqrt{T})$ or $\tilde{\mathcal{O}}(\sqrt{\tilde{d} T K})$ regret, where $\tilde{d}$ is the effective dimension of a neural tangent kernel matrix, $K$ is the size of a subset of arms, and $T$ is the time horizon. For $\texttt{CN-TS}$, we adapt an optimistic sampling technique to ensure the optimism of the sampled combinatorial action, achieving a worst-case (frequentist) regret of $\tilde{\mathcal{O}}(\tilde{d} \sqrt{TK})$. To the best of our knowledge, these are the first combinatorial neural bandit algorithms with regret performance guarantees. In particular, $\texttt{CN-TS}$ is the first Thompson sampling algorithm with the worst-case regret guarantees for the general contextual combinatorial bandit problem. The numerical experiments demonstrate the superior performances of our proposed algorithms.